Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

st situations it is infeasible to have new data on hand to test a

d model. Therefore an in-house test will employ part of the

data to mimic a real-world out-house test on new data. This part

ailable data is commonly reserved from the whole available data

called a testing data set. In contrast, the data used to construct a

is called a training data set. The final or ultimate evaluation of a

d machine learning model must be based on the performance of

alisation test applied on a testing data set.

mmon exercise for the generalisation test of a supervised machine

model is to randomly divide a collected data set into two parts.

em is used as the training data set and the other is reserved as the

ata set. However, it can be difficult to guarantee a robust division

available data so as to deliver unbiased generalisation

nce test. The robustness here means whether such a random

can generate a testing data set which represents the data

on of a training data set. This is a very important point to address

e key of a successful generalisation test process.

approaches have been used for a generalisation test for a

d machine learning model so far. They are the random sampling

, the K-fold cross-validation approach [Devijver and Kittler,

shop, 1996] and the Jackknife test [Quenouille, 1949; Efron and

81].

andom sampling is an application of the Monte Carlo simulation

lis, 1987]. The basic principle of the Monte Carlo simulation is

amples from a population many times to estimate a parameter. It

that when the sampling time increases, the trend of a parameter

pproaching to the true one. Therefore, a parameter can be well-

d if sufficient sampling times can be guaranteed. To demonstrate

works, a one-dimensional Gaussian distribution centred at zero

erated with one million data points. Many samples were then

om this population with a varying number from ten to 10,000 to

the mean of the Gaussian distribution. Figure 3.13 shows the

n result. It can be seen that the estimated mean of the Gaussian

on was gradually converged to the true sample mean, i.e., zero.